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Abstract 

The Jagiellonian-PET (J-PET) collaboration is developing a prototype TOE- 
PET detector based on long polymer scintillators. This novel approach exploits 
the excellent time properties of the plastic scintillators, which permit very pre¬ 
cise time measurements. The very fast, EPGA-based front-end electronics and 
the data acquisition system, as well as, low- and high-level reconstruction al¬ 
gorithms were specially developed to be used with the J-PET scanner. The 
TOF-PET data processing and reconstruction are time and resource demand¬ 
ing operations, especially in case of a large acceptance detector, which works in 
triggerless data acquisition mode. In this article, we discuss the parallel comput¬ 
ing methods applied to optimize the data processing for the J-PET detector. We 
begin with general concepts of parallel computing and then we discuss several 
applications of those techniques in the J-PET data processing 
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1. Introduction 


The Jagiellonian-PET (J-PET) collaboration is developing prototype TOF- 


PET detector based on plastic scintillators 


fl3,3,y,0,3, 


, l8|. The detector 


is a cylinder made of long scintillator strips. Its large acceptance allows for full 
3-D image reconstruction. The main advant^e of the J-PET solution is its 
excellent time resolution (see e.g. results in [3|), which makes it suitable not 
only for medical purpose^ but also for precise studies of the discrete symmetries 
in positronium systems [9|. The TOE-PET data processing and reconstruction 
are time- and resource-demanding operations, especially in case of a large ac¬ 
ceptance J-PET detector, which works in the so-called trigerless mode, in which 
all events (digitized time and amplitudes) from the front-end electronics (FEE) 


are stored to disks without any master trigger condition applied [l3 |. Next, the 
collected raw data undergoes a process of low- and high- level reconstructions. 
The registered data is first transformed into the hit positions in the scintillator 
modules, and in the next step the hits are combined to form the Lines of Re- 
sponse(LOR). In the last stage, the image reconstruction procedures are used 
to obtain the final image based on the set of LORs. In order to efficiently pro¬ 
cess this high data stream, parallel computing techniques have been applied at 
several levels of the data collection and reconstruction. 


2. Parallel processing 

The parallel processing can be defined as a type of computation in which 
the task is divided into independent subtasks, which are then calculated si¬ 
multaneously, by several computing resources. The results of the individual 
computations are merged together. Parallelization techniques can be classified 
according to several criteria, e.g. instruction-level parallelization corresponds to 
the simultaneous performance of several operations in the computer program. 
In the case of the data parallelization, the data set is distributed among many 
computing nodes, while in case of the task parallelization the code is divided 
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into threads and executed across the computing nodes. Typically, to take ad¬ 
vantage of the parallelization, the software procedures must be designed in a 
special way, e.^ by using dedicated programming environments and libraries 


such as MPI [ll|, OpenMP [12| or CUBA [13|. Overview of different paral¬ 
lelization techniques can be found in |l^ . In the past parallel processing was 


the domain of high-performance computing by means of supercomputers. How¬ 
ever, thanks to a very fast development of the overall performance of the CPUs 
, to keeping the prices relatively low and the introduction of new techniques 
such as multi-core processors, the parallelization has become more accessible 
and popular in many different fields. Apart from the CPU processing, recently, 
even more efficient technologies such as Graphical Processor Units (GPU) or 
Field Programmable Gate Array (FPGA) gained a lot of attention. In the J- 
PET project, parallelization by using multi-core CPUs, GPUs and FPGAs are 
used at different stages of data processing. 


3. FPGA processing in FEE and Data Acquisition System 

FPGA is a programmable silicon chip which combines two important fea¬ 
tures: on one hand, the FPGA is reprogrammable, therefore any logic can be 
implemented and changed if needed in hardware description languages such as 
Verilog or VHDL. On the other hand, the compiled program is translated to 
the set of physical connections between the logical arrays, therefore it is really 
the hardware realization of the designed logic with the functionality of the real¬ 
time speed processing, analogically to the one offered by the dedicated ASIC 
processors. Finally FPGA chips are perfect for the parallelization and very 
cost-effective. The FPGA devices are the core computing nodes of the JPET 
FEE and Data Acquisition System (DAQ) [l^. The J-PET FEE was designed 
in view of sampling in the voltage domain of very fast signals at many levels, 
with a raising time of about 1 ns |l^ . A novel technique for precise measure¬ 
ment of time and charge is based solely on FPGA devices and few satellite 
discrete electronic components. One computing board (called Trigger Readout 
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Board TRB) consists of five Lattice ECP3-150 FPGAs. Four FPGAs are used 
as time-to-digital converters and one as a central FPGA node that steers the 
whole board. The multiple computing boards are interconnected via network 
concentrators. The global time synchronization is provided through a reference 
channel. The J-PFT DAQ system allows for continuous data recording over 
the whole measurement period. In total, more than 500 channels with IGb/s 
data rates can be read. The overall constant read-out rate is equal to 50 kHz, 
while reducing the dead time to the level of tens of ns. The described trigger less 
mode of operation allows to store every event without information loss due to 
preliminary selection. On the other hand, a significant amount of disk storage 
is needed (about TB per measurement) to save the data, whereas most of the 
currently registered events contain useless noise information only. In order to 
reduce the data flow and to eliminate background events a new Gentral Con¬ 
troller Module (CCM) is introduced as an intermediate computing node between 
the TRB boards and the disk storage. The CCM is being developed based on 
Xilinix Zynq chip which contains FPGA integrated with the ARM processor. It 
is capable of hardware processing up to 16 Gbit ethernet stream in parallel as 
well as online filtering of the data. Moreover, it is even possible to implement 
some online reconstruction algorithms. Finally, the online monitoring with a 
dedicated data substream will be added. 

4. Data parallelization in the low-level reconstruction framework 

The raw data stored on the disks, is processed in the J-PFT framework, 
which serves as a programming environment which provides useful tools for var¬ 
ious reconstruction algorithms, calibration procedures and which standardizes 
the common operations, e.g: input/output process and more. It also provides 
the necessary information about run conditions, geometry and electronic setups 
by communicating with the parameter database. The architecture of the anal¬ 



ysis framework was already described 


In this paragraph we will 


describe the important parts in the context of understanding framework paral- 
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lelization. In the J-PET framework, the analysis chains are decomposed into 
series of standardized modular blocks. Each module corresponds to a particular 
computing task, e.g. reconstruction algorithm or calibration procedure, with 
defined input and output methods. The processing chain is built by registering 
chosen modules in the JPetManager, which is responsible for the synchroniza¬ 
tion of the data flow between the modules. The framework parallelization is 
implemented by using the PROOF [l^ (Parallel ROOT Facility) extension for 
the ROOT library [1^. PROOF enables parallel file processing on cluster of 
computers or many core machines. In the case of the J-PET framework the 
multi-core processing was tested. Two options are being developed. The first 
solution is a realization of data parallel computing. First, a set of chosen com¬ 
puting tasks ,in the form of processing chain, is registered in the JPetManager as 
described before. The same processing chain will be multiplied and executed in 
parallel for every input file provided. This approach assumes that the input files 
can be analyzed independently. In the second mode, a single processing chain 
can contain modules (subtasks) that can operate in parallel. This solution is 
currently being implemented. 


5. Parallelization at the image reconstruction level 


The final output of the low-level reconstruction phase is a reconstructed set 
of LORs that is provided as the input data for the image reconstruction proce¬ 
dures. The most popular approach based on iterative algorithms derived from 
Maximum Likelihood Estimation Method (MLEM) has been adopted. The 
available time-of-flight information is incorporated to improve the accuracy and 
the quality of the reconstruction. In order to reduce the processing time, paral¬ 
lelization techniques are applied. Currently two implementations are used. The 
first solution exploits the processing capability of Graphical Processing Units 
(GPU). The efficient image reconstruction using list-mode MLEM algorithm 
with approximation kernels was implemented for GPU 22|. Here, the GUDA 
platform was adopted. The second approach is a full 3-D reconstruction based 
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on a multi-core CPU architecture 


2l|. In this case, the most time-consuming 


operations such as projection and back-projections are parallelized. The code is 
based on the OpenMP library. For the current test implementation, the time of 
one MLEM iteration, processed on 40 cores with 128 GB, is about 70 minutes, 
when using the large field-of-view (88 cm x 88cm x 50 cm) with a binning of 
0.5 cm and 1 degree. Typically about 10 iterations are enough to reach MLEM 
optimal reconstruction point. 


6. Summary and Outlook 

In order to reduce the processing time of the data flow, we use the parallel 
computing approach on several stages. We presented the implemented solution 
in the EEE and DAQ level based on the FPGA chips. Also, the multi-core 
GPU-based and GPU-based algorithms are used for the low-level and high-level 
reconstructions. Gurrently, works are ongoing to further reduce the processing 
time, e.g. by implementing the online event filters. Apart from the presented 
computing schemes, in which the data processing is performed locally, several 
remote processing concepts are considered as a replacement to the traditional in¬ 
site computing. The basic idea is to carry outs the resource-heavy computations 
remotely by using cloud or grid-computing H- 
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